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Abstract 

Wc present a novel approach to learning an HMM whose outputs 
are distributed according to a parametric family. This is done by de- 
coupling the learning task into two steps: first estimating the output 
parameters, and then estimating the hidden states transition probabil- 
ities. The first step is accomplished by fitting a mixture model to the 
output stationary distribution. Given the parameters of this mixture 
model, the second step is formulated as the solution of an easily solv- 
able convex quadratic program. We provide an error analysis for the 
estimated transition probabilities and show they are robust to small 
perturbations in the estimates of the mixture parameters. Finally, we 
support our analysis with some encouraging empirical results. 



1 Introduction 

Hidden Markov Models (HMM) are a standard tool in the modeling and 
analysis of time series with a wide variety of applications. When the number 
of hidden states is known, the standard method for estimating the HMM 
parameters from given observed data is the Baum- Welch algorithm [Baum 
et ah, 1970]. The latter is known to suffer from two serious drawbacks: it 
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tends to converge (i) very slowly and (ii) only to a local maximum. Indeed, 
the problem of recovering the parameters of a general HMM is provably hard, 
in several distinct senses [Abe and Warmuth, 1992, Lyngs0 and Pedersen, 
2001, Terwijn, 2002]. 

In this paper we consider learning parametric-output HMMs with a fi- 
nite and known number of hidden states, where the output from each hidden 
state follows a parametric distribution from a given family. A notable exam- 
ple is a Gaussian HMM, where from each state x, the output is a (possibly 
multivariate) Gaussian, J\f{iix,^x)-, typically with unknown iJ,x,T,r^. 

Main results. We propose a novel approach to learning parametric output 
HMMs, based on the following two insights: (i) in an ergodic HMM, the 
stationary distribution is a mixture of distributions from the parametric 
family, and (ii) given the output parameters, or their approximate values, 
one can efficiently recover the corresponding transition probabilities up to 
small additive error. 

Combining these two insights leads to our decoupling approach to learn- 
ing parametric HMMs. Rather than attempting, as in the Baum- Welch 
algorithm, to jointly estimate both the transition probabilities and the out- 
put density parameters, we instead learn each of them separately. First, 
given one or several long observed sequences, the HMM output parameters 
are estimated by a general purpose parametric mixture learner, such as the 
Expectation-Maximization (EM) algorithm. Next, once these parameters 
are approximately known, we learn the hidden state transition probabilities 
by solving a computationally efficient convex quadratic program (QP). 

The key idea behind our approach is to treat the underlying hidden pro- 
cess as if it were sampled independently from the Markov chain's stationary 
distribution, and operate only on the empirical distribution of singletons and 
consecutive pairs. Thus we avoid computing the exact likelihood, which de- 
pends on the full sequence, and obtain considerable gains in computational 
efficiency. Under mild assumptions on the Markov chain and on its output 
probabilities, we prove in Theorem 1 that given the exact output probabil- 
ities, our estimator for the hidden state transition matrix is asymptotically 
consistent. Additionally, this estimator is robust to small perturbations in 
the output probabilities (Theorems 2-6). 

Beyond its practical prospects, our proposed approach also sheds light on 
the theoretical difficulty of the full HMM learning problem: It shows that for 
parametric-output HMMs the key difficulty is fitting a mixture model, since 
once its parameters have been accurately estimated, learning the transition 
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matrix can be cast as a convex program. While learning a general mix- 
ture is considered a hard problem, we note that recently much progress has 
been made under various separation conditions on the mixture components, 
see e.g. Moitra and Valiant [2010], Belkin and Sinha [2010] and references 
therein. 

Related work. The problem of estimating HMM parameters from obser- 
vations has been actively studied since the 1970's, see Cappe et al. [2005], Ra- 
biner [1990], Roweis and Ghahramani [1999]. While computing the maximum- 
likelihood estimator for an HMM is in general computationally intractable, 
under mild conditions, such an estimator is asymptotically consistent and 
normally distributed, see Bickel et al. [1998], Chang [1996], Douc and Matias 
[2001]. 

In recent years, there has been a renewed interest in learning HMMs, 
in particular under various assumptions that render the learning problem 
tractable [Farago and Lugosi, 1989, Hsu et al., 2009, Mossel and Roch, 2006, 
Siddiqi et al., 2010, Anandkumar et al., 2012]. Also, Cybenko and Crespi 
[2011], Lakshminarayanan and Raich [2010] recently suggested Non-negative 
Matrix Factorization (NNMF) approaches for learning HMMs. These meth- 
ods are related to our approach, since with known output probabilities, 
NNMF reduces to a convex program similar to the one considered here. 
Hence, our stability and consistency analysis may be relevant to NNMF- 
based approaches as well. 

Paper outline. In Section 2 we present our problem setup. The algorithm 
for learning the HMM appears in Section 3, and its statistical analysis in 
Section 4. Section 5 contains some simulation results. The technical details 
are deferred to the Appendices. 

2 Problem Setup 

Notation. When X £ X and Y y take values in a discrete set we 
abbreviate P{x) for Pr(X = x) and P{y \ x) for Pr(y = y\X = x). When 
Y € y is continuous- valued, we denote by P{y \ x) the probability density 
function of Y given X. 

For x,w £ M", diag(x) denotes the nxn diagonal matrix with entries Xi 
on its diagonal, x/w is the vector with entries Xi/wi, and = Y^iWixj 

is a weighted £2 norm (for Wi > 0). The shorthand x < y means x < 
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(1 + o(l))y. Similarly we write x <p y for x < (1 + op{l))y. Finally, for a 
positive integer n S N, we write [n] = {1, 2, . . . , n}. 

Hidden Markov Model. We consider a discrete-time, discrete-space HMMs 
with n hidden states. The HMM output alphabet, denoted 3^, may be ei- 
ther discrete or continuous. A parametric-output HMM is characterized by 
a tuple {A,J-^,Pq) where ^ is an n x n column stochastic matrix, Pq is the 
distribution of the initial state and F!^ = {fe^, . . . , fe„) is an ordered tuple 
of parametrized probability density functions. In the sequel we sometimes 
write fi instead of fe^ ■ 

To generate the output sequence of the HMM, first an unobserved Markov 
sequence of hidden states x = {xt}^^ is generated with the following dis- 
tribution. 

T-1 

P{x) = Po(a^o) n "^^t'^t-i' 
t=i 

where Aij = P{Xt = i \ Xt-^i = j) are the transition probabilities. Then, 
each hidden state Xt independently emits an observation Yt € y according to 
the distribution P{yt \ xt) = fxtivt)- Hence the output sequence y = {yt)J=Q 
has the conditional probability 

T-1 T-1 

P{y\x) = II P{yt\xt) = Yl fM- 

t=0 t=0 

The HMM Learning Problem. Given one or several HMM output se- 
quences {Yt)JSQ, the HMM learning problem is to estimate both the transi- 
tion matrix A and the parameters of the output distributions J-^. 

3 Learning Parametric- Out put HMMs 

The standard approach to learning the parameters of an HMM is to maxi- 
mize the likelihood 

T-1 

^ Po{xo)P{yo I xq) Y[ Ax^^x^_^P{yt I Xt). 

As discussed in the Introduction, this problem is in general computationally 
hard. In practice, neglecting the small effect of the initial distribution Pq{xo) 
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on the likelihood, A and are usually estimated via the Baum-Welch 
algorithm, which is computationally slow and only guaranteed to converge 
to a local maximum. 

3.1 A Decoupling Approach 

In what follows we show that when the output distributions are parametric, 
we can decouple the HMM learning task into two steps: learning the output 
parameters 9i, . . . ,9^ followed by learning the transition probabilities of the 
HMM. Under some mild structural assumptions on the HMM, this decou- 
pling implies that the difficulty of learning a parametric-output HMM can 
be reduced to that of learning a parametric mixture model. Indeed, given 
(an approximation to) -7^^'s parameters, we propose an efficient, single-pass, 
statistically-consistent algorithm for estimating the transition matrix A. 

As an example, consider learning a Gaussian HMM with univariate out- 
puts. While the Baum-Welch approach jointly estimates n? + 2n parameters 
(the matrix A and the parameters fii,af), our decoupling approach first fits 
a mixture model with only 3n parameters (vTj, /ij, o"?), and then solves a 
convex problem for the matrix A. While both problems are in general com- 
putationally hard, ours has a significantly lower dimensionality for large n. 

Assumptions. To recover the matrix A and the output parameters 9j 
we make the following assumptions: 

(la) The Markov chain has a unique stationary distribution tt over the 
n hidden states. Moreover, each hidden state is recurrent with a frequency 
bounded away from zero: min^ tt^ > oq for some constant oq > 0. 

(lb) The nxn transition matrix A is geometrically ergodic^: there exists 
parameters G < oo and ip £ [0, 1) such that from any initial distribution Pq 

||^*-Po-7r||^ < 2GV'*, VtGN. (1) 

(Ic) The output parameters of the n states are all distinct: 9i ^ 9j for 
i ^ j. In addition, the parametric family is identifiable. 

Remarks: Assumption (la) rules out transient states, whose presence 
makes it generally impossible to estimate all entries in A from one or a 
few long observed sequences. Assumption (lb) implies mixing and is used 
later on to bound the error and the number of samples needed to learn the 

^Any finite-state ergodic Markov chain is geometrically ergodic. 
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matrix A. Assumption (Ic) is crucial to our approach, which uses the distri- 
bution of only single and pairs of consecutive observations. If two states i,j 
had same output parameters, it would be impossible to distinguish between 
them based on single outputs. 

3.2 Learning the output parameters. 

Assumptions (la, lb) imply that the Markov chain over the hidden states is 
mixing, and so after only a few time steps, the distribution of Xf is very close 
to stationary. Assuming for simplicity that already Xq is sampled from the 
stationary distribution, or alternatively neglecting the first few outputs, this 
implies that each observable 1^ is a random realization from the following 
parametric mixture model, 

n 

Y^Y.^'MV)- (2) 

i=l 

Hence, given the output sequence (If)^^ one may estimate the output 
parameters 9i and the stationary distribution tTj by fitting a mixture model 
of the form (2) to the observations. This is commonly done via the EM 
algorithm. 

Like its more sophisticated cousin Baum- Welch, the mixture-learning 
EM algorithm also suffers from local maxima. Indeed, from a theoretical 
viewpoint, learning such a mixture model (i.e. the parameters of J-f^^) is a 
non-trivial task considered in general to be computationally hard. Nonethe- 
less, under various separation assumptions, efficient algorithms with rigorous 
guarantees have been recently proposed (see e.g. Belkin and Sinha [2010]).^ 
Note that while these algorithms have polynomial complexity in sample size 
and output dimension, they are still exponential in the number of mixture 
components (i.e., in the number of hidden states of the HMM). Hence, these 
methods do not imply polynomial learnability of parametric-output HMMs. 

In what follows we assume that using some mixture-learning procedure, 
the output parameters Oj have been estimated with a relatively small error 
(say \6j — 9j\ = 0(1/\/T)). Furthermore, to allow for cases where 9j were 
estimated from separate observed sequences of perhaps other HMMs with 
same output parameters but potentially different stationary distributions, 
we do not assume that vTj have been estimated. 

^Note that the techniques for learning mixtures assume iid data. However, if these 
are algorithmically stable — as such methods typically are — the iid assumption can be 
replaced by strong mixing [Mohri and Rostamizadeh, 2010]. 
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3.3 Learning the transition matrix A 

Next, we describe how to recover the matrix A given either exact or ap- 
proximate knowledge of the HMM output probabihties. For clarity and 
completeness, we first give an estimation procedure for the stationary dis- 
tribution TT. 

Discrete observations. As a warm-up to the case of continuous outputs, 
we start with HMMs with a discrete observation space of size |3^| = m. In 
this case we can replace J-^ by an m x n column-stochastic matrix B such 
that Sfcj = P{k I i) is the probability of observing an output k given that 
the Markov chain is in hidden state i. In what follows, we assume that the 
number of output states is larger or equal to the number of hidden states, 
m > n, and that the m x n matrix B has full rank n. The latter is the 
discrete analogue of assumption (Ic) mentioned above. 

First note that since the matrix A has a stationary distribution tt, the 
process Yt also has a stationary distribution p, which by analogy to Eq. (2), 
is 

p = Btv. (3) 
Similarly, the pair {Yt,Yt+i) has a unique stationary distribution cr, given 

by 

'^k,k' = ^ T^e^i'/Bk/Bk/^ii. (4) 

i/'&[n] 

As we shall see below, knowledge of p and cr suffices to estimate tt and 
A. Although p and a are themselves unknown, they are easily estimated 
from a single pass on the data (yf)^^: 

Pk = ^2^i{yt=fc}, 

t=0 

1 ^"^ 

<7fc,fc' = Y—r^T^{yt-i=k}T^{yt=k'}- (5) 
i=l 

Estimating the stationary distribution tt. The key idea in our ap- 
proach is to replace the exact, but complicated and non-convex likelihood 
function by a "pseudo-likelihood", which treats the hidden state sequence 
(Xt) as if they were iid draws from the unknown stationary distribution 
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TV. The pseudo-hkehhood has the advantage of having an easily computed 
global maximum, which, as we show in in Section 4, yields an asymptotically 
consistent estimator. Approximating the (Xf) as iid draws from tt means 
that the (Yt) are treated as iid draws from p = Biv. Thus, given a sequence 
(Yt)J~Q the pseudo-likelihood for a vector tt is 

T-l m 

i=0 k=l 

where rifc = YlJ=o ^{yt=k} = Tp^. Its maximizer is 

m 

-n-'^'^ = argmin - ^ log(Sx)fc. (6) 

rri>0, ||x||i=l 

Since — log(x) is convex, {Bx)k is a linear combination of the unknown 
variables Xj, and the constraints are all linear, the above is nothing but a 
convex program, easily solved via standard optimization methods [Nesterov 
and Nemirovskii, 1994]. 

However, to facilitate the analysis and to increase the computational 
efficiency, we consider the asymptotic behavior of the pseudo-likelihood in 
(6), for T sufficiently large so that p is close to p. First, we write 

Next, assuming that T 1 is sufficiently large to ensure \ {Bx)k — Pk\ ^ Pk, 
we take a second order Taylor expansion of log(-Bx)fc in (6). This gives 

n n 

-^Pk'^ogpk - ^{{Bx)k - Pk) + 

k=l k=l 

The first term is independent of x, whereas the second term vanishes. Thus, 
we may approximate (6) by the quadratic program 

argmin \\p - Bx\\f^/^^ (7) 

Xi>0, ||2:||i = l 

where = Ylk'^kxl. is a weighted £2 norm w.r.t. the weight vector 

w. Eq. (7) is also a convex problem, easily solved via standard optimiza- 
tion techniques. However, let us temporarily ignore the non-negativity con- 
straints > and add a Lagrange multiplier for the equality constraint 
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k=i j=i j 

Differentiating with respect to Xi yields 

Wx = {l + X)l, (9) 

where W = B'^ diag{l / p)B. Enforcing the normalization constraint is equiv- 
alent to solving for x* = W~^l and normalizing tt = x* / Note that 
if all entries of x* are positive, tt is the solution of the optimization problem 
in (7), and we need not invoke a QP solver. Assumptions (la,lb) that vr^ 
is bounded away from zero and that the chain is mixing imply that for suf- 
ficiently large T, all entries of tt will be positive with high probability, see 
Section 4. 



Estimating the transition matrix A. To estimate A, we consider pairs 
(Yi, It+i) of consecutive observations. By definition we have that for a single 
pair, 

P{Yt = k,Yt+i = k') = Y,Bk'rBkjAjP{Xt = j). 

As above, we treat the T — 1 consecutive pairs {Yt,Yt+i) as independent 
of each other, with the hidden state Xt sampled from the stationary dis- 
tribution TT. When the output probability matrix B and the stationary 
distribution tt are both known, the pseudo-likelihood is given by 

(k,k') ij 

where Ukk' = YlJ=i '^{yt-i=k}^{yt=k'} = {T - l)akk'- The resulting estimator 
is 

argmin - V dtw log ( V Cf/'^^j) (10) 

where C^-^ = TTjBkjBk'i- In practice, since tt is not known, we use Cij' = 
TCjBi^jBi^/i, with TT instead of tt. Again, (10) is a convex program in A 
and may be solved by standard constrained convex optimization meth- 
ods. To obtain a more computationally efficient formulation, let us as- 
sume that miufc^fc/ (Tfc^fc/ > 02 > 0, and that min^ /;/ TiTfcfc/ ^ 1, so that 
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\{CA)kk' - <5"fcfc'| < ^kk', where {CA)kk' = Yjij ^if ^ij- Then, as above, the 



approximate minimization problem is 



argmm 



a-CA 



2 

(11) 

1/<T 



In contrast to the estimation of tt, where we could ignore the non- negativity 
constraints, here the constraints Aij > are essential, since for realistic 
HMMs, some entries in A might be strictly zero. Finally, note that if tt = tt 
and a = cr, the true matrix A satisfies cr = CA and is the minimizer of 
(10). 

In summary, given one or more output sequences {yt)t=Q and an estimate 
of -B, we first make a single pass over the data and construct the estimators 
p and (T, with complexity 0{T). Then, the stationary distribution tt is 
estimated via (9), and its transition matrix A via (11). To estimate A, 
we first compute the matrix product C'^C, with O(n^m^) operations. The 
resulting QP has size n^, and is thus solvable [den Hertog, 1994] in time 
O(n^) — which is dominated by 0{n^m?') since m > n by assumption. 
Hence, the overall time complexity of estimating ^4 is 0(T + 7i^m?). 



Extension to continuous observations. We now extend the above re- 
sults to the case of continuous outputs distributed according to a known 
parametric family. Recall that in this case, each hidden state i € [n] has an 
associated output probability density fg^{y)- As with discrete observations, 
we assume that an approximation (0i,...,0„) to /j's parameters is given 
and use it to construct estimates of tt and A. 

To this end, we seek analogues of (3) and (4), which relate the ob- 
servable quantities to the latent ones. This will enable us to construct 
the appropriate empirical estimates and the corresponding quadratic pro- 
grams, whose solutions will be our estimators tt and A. To handle infinite 
output alphabets, we map each observation y to an n-dimensional vector 
(p{y) = ifdiiy), ■ ■ ■ , fe„{y))-, whose entries are the likelihood of y from each 
of the underlying hidden states. As shown below, this allows us to reduce 
the problem to a discrete "observation" space which can be solved by the 
methods introduced in the previous subsection. 



Estimating the stationary distribution tt. To obtain an analogue of 
(3), we define the vector ^ € R", and matrix K G R"^", which will play the 
role of p and B for discrete output alphabets. The vector ^ is defined as 
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^ = E[93(y)], or more explicitly, 

n „ 

6 ^ nfk{Y)]=Y.TTj fk{y)P{y\j)dy. 



Similarly, the entry of K is given by 



K,, = E[/,(y) \X=j\= [ h{y)P{y \ j)dy. 

Jy 



(12) 



y 

With these definitions we have, as in Eq. (3), 

$, = Kt7. (13) 

Thus, given an observed sequence {yt)J=Q we construct the empirical esti- 
mate 



T-l 



ik = fkiyt) 



(14) 



t=o 



and consequently solve the QP 



TT = argmm 



(15) 



In analogy to the discrete case, we assume rank(il') = n so (15) has a unique 
solution. Its asymptotic consistency and accuracy are discussed in Section 
4. 



Estimating the transition matrix A. Next, following the same paradigm 
we obtain an analogue of (4). Bayes rule implies that for stationary chains, 

Pik\Y) = ^Mp^. (16) 

We define the matrices r] G R"^" and F G M"^" (analogues of cr and B) as 
follows. Let Y and Y' be two consecutive observations of the HMM, then 

7]kk' = B[P{k\Y)P{k'\Y')] 

Fkj = B[P{k\Y)\j] = [ P{k\y)P{y\j)dy. (17) 

Jy 
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A simple calculation shows that, as in (4), 

n 

ilkk' = ^ Fk'iFkjAij-n-j. (18) 

Since here F plays the role of we may call it an effective observation 
matrix. This suggests estimating A with the same tools used in the discrete 
case. Thus, given an observed sequence {yt)J=o we construct an empirical 
estimate f) by 



i 

Vkk 



1 

— Y^P{k\yt.,)P{k'\yt), (19) 
t=i 



where P is given by (16) but with vr replaced by tt. Consequently we solve 
the following QP 



A = argmin 



2 



V-iCA) (20) 

1/t7 



where C^J'' = T^jF^jF^i and {CA)^^! = Y^ij C^J'' Aij . As for the matrix B 
in the discrete case, to ensure a unique solution to Eq. (20) we assume 
rank(F) = n. 

Remark 1. Instead of (18), we could estimate rj'^j^i = ^[fk{Y)fki{Y')\, 
from which A can also he recovered, since 

n 

v'k,k' = X] Kk'iKkjAijTTj. 

This has the advantage that for many distributions the matrix K can he cast 
in a closed analytic form. For example in the Gaussian case, while F needs 
to he calculated numerically, we have 

11 / l(;Ui-^j)M 



Additionally, K does not depend on the stationary distribution. The draw- 
back is that in principle, and as simulations suggest, accurately estimating 
•q' may require many more samples, see Appendix for details. 
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In summary, given approximate output parameters {6i, . . . , On), we first 
calculate the n x n matrix K. Next, we construct the vector ^ by a single 
pass over the data {Yt)JSQ. Then the stationary distribution vr is estimated 
via (15). Given tt, we calculate the nx n matrix F, construct the empirical 
estimate f), and estimate A via (20). As in the discrete observation case, 
the time complexity of this scheme is 0{T + n^) with additional terms for 
calculating K and F. 

4 Error analysis 

First, we study the statistical properties of our estimators under the as- 
sumption that the output parameters, (^i, . . . ,9n) in the continuous case, 
or the matrix B in the discrete case, are known exactly. Later on we show 
that our estimators are stable to perturbations in these parameters. For 
simplicity, throughout this section we assume that the initial hidden state 
Xq is sampled from the stationary distribution tt. This assumption is not 
essential and omitting it would not qualitatively change our results. All 
proofs are deferred to the Appendices. 

To provide bounds on the error and required sample size we make the 
following additional assumptions: 

(2a) In the discrete case, there exists an ai > such that miuj pj > ai. 

(2b) In the continuous case, all fe- are bounded: 

max sup /e, (y) < L < oo. 

«G[n] ygM 

Finally, for ease of notation we define 

2G 

Asymptotic Strong Consistency. Our first result shows that with per- 
fectly known output probabilities, as T — )• oo, our estimates tt, A are strongly 
consistent. 

Theorem 1. Let {Yt)J~Q be an observed sequence of an HMM, whose Markov 
chain satisfies Assumptions (la, lb). Assume rank(i?) = n in the discrete 
case, or rank(F) = rank(i^') = n in the continuous case. Then, both esti- 
mators, TT of (9) and A of (11) in the discrete case, or (15) and (20) in the 
continuous case, are asymptotically strongly consistent. Namely, asT ^ oo, 
with probability one, 

TT — ^ TT and A ^ A. 
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Error analysis for the stationary distribution tt. Recall that to esti- 
mate TV in the discrete case, we argued that for sufficiently large sample size 
T, the positivity constraints can be ignored, which amounts to solving an 
n X n system of linear equations, Eq. (9). The following theorem provides 
both a lower bound on the required sample size T for this condition to hold 
with high probability, as well as error bounds on the difference tt — tt. 

Theorem 2. Discrete case; Let p be given by (5), and tt be the solution 
of (9). Let B = diag(l/y^)i3, and (Ti{B) be its smallest singular value. 
Under Assumption (2a), a sequence of length 



q,i, \/log n , , 

T> ^ ~ , (21) 
^ aoaiai{B) 

is sufficient to ensure that with high probability, all entries in tt are strictly 
positive. Furthermore, as T —)■ oo, 



I---II2<pa/;^t4^- (22) 



TajaliB) 



Next we consider the errors in the estimate tt for the continuous ob- 
servations case. For simplicity, instead of analyzing the quadratic program 
(15) with a weighted £2 norm, we consider the following quadratic program, 
whose solution is also asymptotically consistent: 

min \\$-Kx\\l. (23) 

This allows for a cleaner analysis, without changing the qualitative flavor of 
the results. 

Theorem 3. Continuous case.- Let ^ be given by (I4), ^ be the solution 
of (15), and K = diag(l/v^)i^. Under Assumption (2b), as T ^ 00, 



IIt^-ttL <P (24) 

Error Analysis for the Matrix A. Again, for simplicity, instead of an- 
alyzing the quadratic programs (11) and (20) with a weighted ^2 norm, we 
consider the following quadratic programs, whose solutions are also asymp- 
totically consistent for i> G {(T,/)}: 

min \\v-CA\\l. (25) 
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Note that this QP is apphcable even if I'kk' = for some k,k' , which imphes 
that Pfcfc/ = as weh. 

Theorem 4. Discrete case. Let A be the solution of (25) with i> = a 
given in (5). Then, as T ^ oo, 



A- A 



2 

^4 



T4aial'{B) 



and thus an observed sequence length 



T > 



- aialaf{B) 
suffices for accurate estimation. 



(26) 



(27) 



Theorem 5. Continuous case. Let A be the solution of (25) with v = i] 
given in (19). Then, as T ^ oo, 



A- A 



<p 



I (n7 In n)4L4 
Talaf{F)af{K) 



and thus an observed sequence length 



T > 



(28) 



(29) 



- alal{F)ai{K) 

suffices for accurate estimation. 

Remarks. Note the key role of the smallest singular value ui, in the 
error bounds in the theorems above: Two hidden states with very similar 
output probabilities drive cJi to zero, thus requiring many more observations 
to resolve the properties of the underlying hidden sequence. 



Inaccuracies in the output parameters. In practice we only have ap- 
proximate output parameters, found for example, via an EM algorithm. For 
simplicity, we study the effect of such inaccuracies only in the continuous 
case. Similar results hold in the discrete case. To this end, assume the errors 
in the matrices K and F of Eqs. (12) and (17) are of the form 

k = K + eLQ, F = F + eP, (30) 

with < 1. The following theorem shows our estimators are 

stable w.r.t. errors in the estimated output parameters. Note that if K, F 
are estimated by a sequence of length T, then typically e = 0(r-i/2). 
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Figure 1: The mixture and its components. 



Theorem 6. Given an error of e in the output parameters as in Eq. (30), 
the estimators given in Theorems 3 and 5, incur an additional error of at 
most 



with r = 1 for estimating tt, and r = | for estimating A, and where ci 
is the smallest singular value of K/L'^ when estimating tt, and of F when 
estimating A. 

5 Simulation Results 

We illustrate our algorithm by some simulation results, executed in MAT- 
LAB with the help of the HMM and EM toolboxes^. We consider a toy 
example with n = 4 hidden states, whose outputs are univariate Gaussians, 
J\f{iJ,i,af), with A, and tt given by 




(31) 



A 



/ 0.7 0.0 0.2 0.5 \ /i = AA(-4,4) 
0.2 0.6 0.2 0.0 /2 = AA(0, 1) 

0.1 0.2 0.6 0.0 ' /s = AA(2,36) 

V 0.0 0.2 0.0 0.5 / /4 = AA(4, 1) 
ttT = (0.3529,0.2941,0.2353,0.1176). 



Fig. 1 shows the mixture and its four components. 



''Available at http : //www. cs .ubc . ca/~murphyk and http: //www .mathworks . com/ (un- 
der EM_GM_Fast). 
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To estimate A we considered the following methods: 





method 


initial 9 


initial A 


1 


BW 


random 


random 


2 


none 


exactly known 


QP 


3 


none 


EM 


QP 


4 


BW 


exactly known 


QP 


5 


BW 


EM 


QP 


6 


BW 


exactly known 


random 


7 


BW 


EM 


random 



Fig. 2 (left) shows on a logarithmic scale E||A — ^||^ vs. sample size T, 
averaged over 100 independent realizations. Fig. 2 (right) shows the running 
time as a function of T. In these two figures, the number of iterations of the 
BW step was set to 20. 

Fig. 3 (left) shows the convergence of E||A — AWp as a function of the 
number of BW iterations, with known output parameters, but either with 
or without the QP results. Fig. 3 (right) gives E||^ — j4||^ as a function 
of the number of BW iterations for both known and EM-estimated output 
parameters with 10^ samples. 

The simulation results highlight the following points: (i) BW with a ran- 
dom guess of both A and the parameters 9j = {fij,aj) is useless if run for 
only 20 iterations. It often requires hundreds of iterations to converge, in 
some cases to a poor inaccurate solution (results not shows due to lack of 
space); (ii) For a small number of samples the accuracy of QP+EM (method 
3) is comparable to BW+EM (method 5) but requires only a fraction of 
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Figure 3: Convergence of the BW iterations. 



the computation time, (iii) When the number of samples becomes large, 
the QP+EM is not only faster, but (surprisingly) also more accurate than 
BW+EM. As Fig. 3 suggests, this is due to the slow convergence of the 
BW algorithm, which requires more than 20 iterations for convergence, (iv) 
Starting the BW iterations with (//j,(T?) estimated by EM and A estimated 
by QP as its initial values significantly accelerated the convergence giving 
a superior accuracy after only 20 iterations. These results show the (well 
known) importance of initializing the BW algorithm with sufficiently accu- 
rate starting values. Our QP approach provides such an initial value for A 
by a computationally fast algorithm. 
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6 Appendix 



We now give a detailed account for the theorems stated in section 4. 
6.1 Preliminaries I 

In what follows we use the following notation: For an n x n matrix A, 
vec{A) £ M" is the result of stacking its columns vertically into a single 
long vector. Thus, its Frobenius matrix norm is ||^||^ = ||vec(A)||2. 
Recall the definition of g^: 

2G 

One can easily verify that for 2G > 1, we have 1 + ijjg^ < g^. Also recall 
that assumption (2b) states that the distributions in J^^ are bounded by L, 
which is defined by: 

maxsup/g. (y) < L < oo. 

The following concentration result from Kontorovich and Weiss [2012, 
Theorem 1] is our main tool in proving the error bounds given here. 

Lemma 1. Let Y = Yq,. . . G be the output of a Hidden Markov 

chain with transition matrix A and output distributions J-^. Assume that A 
is geometrically ergodic with constants G,ip as in (1). Let F : (Yq, . . . , Y^-i) i— )• 
M be any function that is l-Lipschitz with respect to the Hamming metric on 
y'^. Then, for all e > 0, 

P(|F(y) -EF| > eT) < 2exp . (32) 

We will also need the following Lemma (proved in [Kontorovich and 
Weiss, 2012] for the discrete output case but easily generalize to continuous 
outputs) for bounding the variance of our estimators. 

Lemma 2. Let f{y) : M. — )• be a function of the observables of an n 
states geometrically ergodic HMM with constants (G, ip) and 

f{y)dy < 1. 

'y 

Assume the HMM is started with the stationary distribution n. Then 

T-l 



Var 



T 

t=o 



^ Var[/(y)] ^ i;g^B[f{Y)] 
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Similarly, let g{y,y') : M x 
tions (y, y') such that 



be a function of consecutive observa- 



Then 



Var 



Jj^giy,y')dydy' < 1. 



T-l 



t=i 



^ Var[g(y,rO] 
T-l 

(i + V^g^)E[g(y,yO] 

T-l 



6.2 Accuracy of p, ct, ^ and t) 

Since our estimators tt and A are constructed in terms of p and & in the 
discrete case, and ^ and in the continuous case, let us first examine the 
accuracy of the later. The following results shows that geometric ergodicity 
is sufficient to ensure their rapid convergence to the true values. 

Lemma 3. Discrete case. Let {yt)J=i be an observed sequence from a dis- 
crete output HMM whose initial state Xq follows the stationary distribution 
TT. Let p be given by (3) and a by (4) with their empirical estimates given 
in (5). Then 



E[||p-p||2] < 



Ef||6- - crl 



< 



1 + 



T-l 



Furthermore, for any e > 



^'(IIP-P||2> V^ + ^)^2exp 



and 



P\\\a-aL>J'-±^+e]< 



T-l 



(33) 
(34) 

(35) 
(36) 
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Finally, we have for any fixed v G M™ with ||a 
P(|(p,v)-(p,v)| >e) <2exp 



22V 

4 



(37) 



Proof. First note that w.r.t the Hamming metric, r| |p— p| I2 and | (p, v) — (p, ^ 
are 1-Lipschitz and T\\a- — cr\\2 is 2-Lipschitz. Thus the claims in (35, 36, 
37) all follows directly from Lemma 1 where for (35, 36) we also take into 
account (33) and (34) respectively. In order to prove (33) note that 



E[ 



PII2] 



k&[n] 



Pkf 



^ Var{pk). 
fee[n] 



So by taking in Lemma 2, f{y) = 1y=k, we have E[lj^=fc] = pk and Var{'ly=k) 
Pk{^ — Pk) 1^ Pk- Since X^^Li Pfc = 1 we get the desired bound. 

The bound in (34) is obtained similarly by taking g{y,y') = ly=fclj/'=fc' 
in Lemma 2 with the fact that ^kk' '^kk' = 1- D 

Lemma 4. Continuous case. Let {Yt)]Li be an observed sequence from a 
continuous observations HMM whose initial state Xq follows the stationary 
distribution tt. Let ^ be given by (13) , rj by (18) and ^ and i) be their 
empirical estimates, given by (I4) and (19) respectively. Then for any 
e > , 



> e ) < 2n exp 



and 



1-^ 



P{\\f)-ri\\^>e) < 



2n exp 



2Te' 



(38) 



(39) 



2(r- l)e^ 



Proof. Note that E^^ = and TS^k is L-Lipschitz for all G [n]. Thus by 
Lemma 1 and the union bound we have 



P 



> e ) < 2nexp 



2Te^ 



/2 



(40) 



Since 
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we have 



P 



> e . 



putting e' = e/ \/n in (40), the claim in (38) follows. 

The proof of (39) follows the same paradigm as the proof for (40) . Indeed 
= and T%fc' is 1-Lipschitz so by Lemma 1 and the union bound 
we have 



Since 



^(ll^-^lloo >e') <2n'exp 



2Te' 
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\\V-V\\l= ^ {flkk' - r]kk'f < \\f] - -nWl^ , 

fc,fc'e[n]x [n] 

we have 

^(11^-^112 > e) < ^(^ll^-^lloo > «)• 
putting e' = e/n in (41), the claim in (39) follows. 



(41) 



□ 



6.3 Proof of theorem 1 - Strong consistency 

We now prove the strong consistency of our estimators stated in Theorem 
1. 

Proof. For the discrete case, by Lemma 3, the expectation E[||p — p\\^ goes 
to zero as T — oo. Furthermore, using the Borel-Cantelli lemma, \\p — p\\2 
converge to its expectation a.s. concluding that p converges a.s. to p. The 
same argument goes for a, ^, f] and cr, ^, ij respectively. 

Now, the function / : R™ ^ R" given by /(x) = {B'^ diag{l/x)B)-'^l is 
continuous on M™. Moreover, f{p) = tt since the optimization problem (7) 
has a unique minimizer x* for all p, which in particular is given by j;* = tt 
when p = p. Since p G M!p by assumption, the argument above shows that 
almost surely, p € for all sufficiently large T. Therefore, lim2"_s.oo /(p) = 
f{p) = TV almost surely, and the asymptotic strong consistency of tt is 
established. 

To prove the asymptotic strong consistency of A in the discrete case, 
recall that the minimizer of the quadratic program x'^Kx — K^x subject to 
Gx < g, Dx = d, is continuous under small perturbations of K,h,G,D,d 
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[Dantzig et al., 1967]. In particular, if tt is sufficiently close to tt then A is 
close to A. Since tt — t- tt and (t — t- cr almost surely, we also have ^-^^4. 

For the continuous observations case, note that vr and A are also solutions 
of quadratic programs. Also note that ^ — )• ^ and fj ^ rj almost surely. Thus 
we have that A-^^A and tt-^tt as above. □ 



6.4 Proof of Theorem 2: Bounding the error for n in the 
discrete observations case 

Proof. Lemma 3 and the fact that — p||oo < — p||2 implies that \\p — 
pIIoo = Op{l/\/T). Hence we make a change of variables, 

1 



p = p-\ — ^C- 



(42) 



To establish the (eventual) positivity of the entries of tt, we consider the 
solution X* of (8) with A = 0, e.g. without the normalization = 1, and 
write it as X* = tt + 5. Our goal is to understand the relation between 6 and 

C- 

Observe that 6 satisfies the system of linear equations 



BkjBki 



Pk [1 + ^ 



]_Cfc 
pk 



We need T sufficiently large so that, with high probability, max/^. -4^7r ^ 1? 

or equivalents, \pk- Pk\ <- Pk- 
By taking T > Ag^/a\ we have 

E[||p-p||J <ai/2. 

So choosing e = min/3fc/2 > ai/2 in (35), this condition is satisfied for 
T > g'^/af. Then, approximating 1/(1 + e) = 1 — e + O(e^) gives 



E 



E 



Pk 



1 Ck 



T Pk 
l + Op 



(ttj + 5j) 



Note that since Btz = p, the leading order correction for 6 is simply 



1 



{B^B)-'B^ ( - ) + 0p 
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where the matrix B = diag(l/y^)i?. 

Let {uj} and {vj} be the right and left singular vectors of B with non- 
zero singular values cri{B), where ai < a2 ■ ■ ■ < CTn] thus, Bui = aiVi. The 
fact that B also has n non-zero singular values follows from its definition 
combined with our Assumption 2d that B has rank n. Then 

B'B = Y,^f^^< (43) 

i 

and hence, 

^5;l(i,v.)u.+0p(i) (44) 



For the solution x to have strictly positive coordinates we need that \5j\ < iTj 
for each of j = 1, . . . ,n. Without loss of generality, assume that tti = miuj ttj 
and analyze the worst-case setting. This occurs when the singular vector 
ui with smallest singular value coincides with the standard basis vector ei. 
Then, 

l^il < -T^^m^. KC, vi)| + Op (i) . (45) 

It follows from (37) that \5i\ will be dominated by minvrj > provided 
that 

r>^^. (46) 

In the unlikely event that (i) the vector tt is uniform (vTj = 1/n for all 
j), (ii) the matrix B has n identical singular values, we need the equation 
analogous to (45) to hold for all n coordinates. By a union bound argument, 
an additional factor of log n in the number of samples suffices to ensure, with 
high probability, the non-negativity of the solution x. 

Next we proceed to bound ||7r — vrHg. To this end, we write 



Since both the {uj} and the {vj} are orthonormal, 

ml - Y.i^^-^ 

a({B){mm pkY ^ 



< 



l|2 



al{B)al 
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1 1 2 

Bounding \\p — p\\2 via Lemma 3 and noting that 



TT 



^<2||x*-7r||2 = 2||5||2, 



ll^'lli 

the result in (22) fohows. □ 
6.5 Preliminaries II 

The remaining estimators (vr for the continuous observations case, and A for 
both the discrete and continuous observations cases) are obtained as solu- 
tions for quadratic programs. Let us take for example the QP for calculating 
TT with continuous observations HMM, given in (23). For this case, the QP 
is equivalent to 

TT = argmin —x^K^Kx — x^ K^^ 
X 2 

subject to a; > and Xi = 1. 

Note that if ^ was equal to its true values ^, the solution of the above 
QP would simply be the true tt. In reality, we only have the estimate ^. 
In order to analyze the error ||7r — vrUg, we will need to consider how the 
solutions of such a quadratic program are affected by errors in ^. 

More generally, we are concerned with two QPs 

m.inQ{x) = min —x^ Mx — x^ h, (47) 
m.mQ{x) = min —x^ Mx — x^ h, (48) 

both subject to Gx < g, Dx = d. We assume that the solution to the first 
QP is the "true" value while the solution to the second is our estimate. So 
bounding the estimate error is equivalent to bounding the error between the 
solutions obtained by the above two QPs, where M and h are perturbed 
versions of M and h. 

Given that, note that only the objective function has been perturbed, 
while the linear constraints remained unaffected. We may thus apply the 
following classical result on the solution stability of definite quadratic pro- 
grams. 

Theorem 7. [Daniel, 1973] Let A = Amin(^^) the smallest eigenvalue of 
M, and let e = max{||M — M\\2, \\h — h\\2}- Let x and x be the minimizers 
of Eqs.(47) and (48), respectively. Then, for e < A, 

\\X - X\\2 < -— (1 + Iklb)- 

A — e 
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In the following we will obtain bounds on e and A for the different esti- 
mators and invoke the above theorem. 



6.6 Proof of Theorem 3: Bounding the error for n in the 
continuous observations case 

Proof. Note that in the notation given in Theorem 7, we have h = ^ K and 
h = ^ K. Since we assumed that the output density parameters are known 
exactly we have no error in M = K K. 
It is immediate that 



and 



e < 



\K\\^ < nL 



From Lemma 4 we have 



<P 



{nlnn)g'^L'^ 
f '■ 



while by Theorem 7 we have 



|7r-7r||2 < 



XrraniK'K) 



Since ||7r||2 < 1, the claim follows. 



□ 



As a side remark we note that the form of (24) is somewhat counter- 
intuitive, as it suggests a worse behavior for larger L. Intuitively, how- 
ever, larger L corresponds to a more peaked — and hence lower-variance 
— density, which ought to imply sharper estimates. Note however that as 
numerical simulations suggest we typically have 



al{k) 



0(1). 



Thus, whenever o'\{F) is well behaved so is the estimate in (24) and the 
bound is reasonable after all. Finally note that F is stochastic so it behaves 
very much like the matrix B in the discrete outputs case. 
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6.7 Proof of Theorem 4: Bounding the error of A in the 
discrete observations case 

Let A be the solution of 

mill ||<5- - CA\\l, (25) 
Aii>0,E,A.,=l 

where a is given in (5). Recall that C^J'' = iTjBkjBk'i and C^J'' = TTjBkjBk'i- 
First note that if tt and cr were known exactly, the above QP could be written 
as 

m.inQ{A) = min ^ vec(A)^M vec(^) - vec(^)^/i (49) 

where M = C'' C and h = vec(<T). Its solution is precisely the transition 
probability matrix A. In reality, as we only have estimates tt and <t, the 
optimization problem is perturbed to 

min(5(A) = min^ vec(A)TMvec(^) - vec(A)^/i (50) 

where M = C, and h = C vec((T). 

To analyze how errors in a and C affect the optimization problem we 
follow the same route as above. Thus we need to bound — /i||2, — 
M||2, and the smallest eigenvalue of M. Regarding the latter, by definition, 
Amin(-^) = cr? (C*); where cji{C) is the smallest singular value of C. A simple 
exercise in linear algebra yields 

oi{C) > aoafiB). (51) 

The following lemma provides bounds on ||M — M||2 and on — /i||2. 

Lemma 5. Asymptotically, as T —t- oo, 

\\h - h\\2 <P Vn{\\n - 7v\\2 + ||ct - o-lla) (52) 

and 

Il^-^'^ll2<p2n||7r-7r||2. (53) 

Proof. By definition, hij = J2k,k' ^if^kk', and hij = J2k,k' C^fakk'- Using 
the definitions of C and C, up to mixed terms 0(||7r — ttIIooIIct — cHoo), we 
obtain 



hij - hij = (TTj - TTj) ^ BkjB^ 



k'i<7kk' 



kk' 



^ BkjBk'iiakk' — CFkk' 



kk' 
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Since each of ||7r — 7r||^ and ||(t — cr||^ are Op{l/yT), the neglected mixed 
terms are asymptoticahy neghgible as compared to each of the first two ones. 
Next, we use the fact that akw < li T^j < 1 and Ylkk' ^kjBk'i < 1 to obtain 
that 



h-h 



<P \fn IItt — 7r||2 + ^Jn ||vec(<T) — vec(cr) 



12 

Similarly, we have that for the r? x matrix M, and not including higher 
order mixed terms (tTj — 7rj)(-n"^ — tt^), which are asymptotically negligible. 



(M - M)ij^ap = (ttj - 7rj)7r/3 ^ BkjBkpBk'iBk 

kk' 



k'a 



kk' 



Note that ^Ifcfc' BkjBkpBu^iBk'a = {Y.k BkjBkp){Y.k' Bk'iBk'a) < 1- Hence, 
by similar arguments as for h, (53) follows. □ 

We can now prove Theorem 4: 

Proof, (of Theorem 4) Lemma 3, together with (22), implies that with 
high probability, 



4 



1' 



and 



TT — TT 



2 



4 



ra?a?(S) 

Inserting these into (52) and (53) yields, w.h.p., 



max • 



h-h 



M - M 



< 



TalaliB) 



By Theorem 7, we have that 



A- A 



< 



Ai(M) 



(54) 



(55) 



where ||^||_p < ^/n since A is column-stochastic. The claim follows by 
substituting the bounds on e in (54) and on Ai(M) = o"^(C) > aQaf{B) in 
(51) into (55) and noting that afiB) > afiB). □ 
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6.8 Proof of Theorem 5: Bounding the error of A in the 
continuous observations case 

Let A be the solution of 

min \\f}-CA\\l, (25) 

where f] is given in (19) and C^J'' = iijFkjFk'i and Cj'j'' = TTjFkjFkn. The 
above QP can be written as 

miiiQ{A) = min^ vec(A)TMvec(^) - vec(A)^/i (56) 

where M = & C, and h = & vec((T). 

Exactly as in the previous subsection, we want to bound the difference 
between the solutions for the above QP and the unperturbed one. 

First note that 

fTi(C) > a^aliF). (57) 
Next we give the analogue of lemma 5. 
Lemma 6. Asymptotically, as T ^ oo, 

11^ - h\\2 <P Vn ( — - 7r||2 + \\V - vh] (^8) 

and 

\\M-M\\2 <P 2 J'^~'^^^\ (59) 
ao 

Proof. In contrast to Lemma 5, here F is also perturbed due to errors in tt 
with 

JyYlk^kfkiy) 

Expending the difference AFjj = 
find that 



up to first order in tt — tt we 



IIAFII^ < II" -"II- ||F||^ < ^"ll"-"ll-, 
ao ao 

where in the last inequality we used the fact that F is stochastic. Repeating 
the arguments in the proof for Lemma 5 and noting that ao ^ 1 we get (58) 
and (59). □ 
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We now come to the proof of Theorem 5. 



Proof, (of Theorem 5) Lemma 4, together with (24), imphes that with 
high probabihty, 



Iv-vWf 



and 



I2 



(n2 lnn)c/2 
T- 1 ' 



TafiK) 



Inserting these into (58) and (59) yields, w.h.p., 



max • 



h-h 



M- M 



< 



Tafik) 



By Theorem 7, we have that 



(60) 



A- A 



< 



F ~ Ai(M) 



:i + ii^ii 



(61) 



where ||^||_f < ^/n since A is column-stochastic. The claim follows by 
substituting the bounds on e in (60) and on Ai(M) = cTf(C) > agcrf (F) in 
(51) into (61) and noting that crf(F) > crf(F). □ 

As for remark 1, we point out that estimating rj' with the help of the 
matrix K (instead of 77 with F) results in an estimator that is not 0{1/T)- 
Lipschitz any more but 0(L^ /r)-Lipschitz with L = maxjg[„] sup^gg foiiy)- 
This means that in principle we will need many more samples to accurately 
estimate rj' compared to rj, see Lemma 4. Thus, since in high dimensions 
calculating F via numerical integration may be computational intensive, 
choosing between the two estimators is in some sense choosing between 
working with limited number of samples and computational efficiency. 



6.9 Proof of Theorem 6: Perturbations in the output param- 
eters 

We give here the proof for the perturbation in the matrix F. The proof for 
perturbations in the matrix K is similar. 
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Proof. By definition, bij = Y.k,k' ^ij ^kk' , and bij = '}2k,k' ^ij ^kk'- Using 
the definitions of C and C, up to first order in {||7r — 7r||^ , \\a — cr\\^ , ep} 
we obtain 



bij bij 



(^j ~ "^i) X/ ^kjBk'iCTkk' 
kk' 

+T^j ^ BkjBk'iiakk' — CTkk') 

kk' 

X] (^kjBk'i + BkjPk'i) (Jkk'- 



kk' 



As the two first terms already considered we focus on the last term. It 
can be shown that: 



^ I VTj ^ PkjBk'iC^kk' I < n \\P\ 

ij \ kk' J 



Thus 



6-6 



< "v/n (11^ ~ '''"II2 + lbec((5") — ?;ec((T)||2 + 
+2eF||P||^) (1 + 0(1)). 

Similarly, for the matrix K up to first order in {||7r — 7r||^ , e^?} we have 
{k - K)ij^ai3 = (ttj - 7rj)7r/3 ^ BkjBkisBk'iB^ 

kk' 

+ {-^13 - -^i3)-^j ^ BkjBkpBk'iBk 

kk' 

+ epT^jirfs ^ PkjBkpBk'iBk'a + ■ 

kk' 

+ (^F'^/3'^j ^ BkjBkpBk'iPk'a- 



(62) 



kk' 



Again considering only the terms including P and using the facts that 
Efc BkjBkis < 1 and Y.kk'{PkjBk'iY <Y.k ^kj similarly find that 



K -K 



< (1 + Op(l))2n (llvr - 7r||2 + Aep \\P\\f) ■ 



Repeating the analysis in the proofs for Theorems 3, 4 and 5 give the 
desired result. □ 
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